Speech-to-Text Conversion

Techniques, Tools, and Trade-offs

Omkar Ninav
June 30, 2025

What is Speech-to-Text?

  • Converts spoken language into written text
  • Also known as Automatic Speech Recognition (ASR)
  • Used in applications like:
    • Virtual assistants (e.g., Siri, Alexa)
    • Captioning & subtitling
    • Meeting transcription
    • Voice commands

How Speech-to-Text Works

  1. Audio Input
    Voice signal captured from a microphone or file

  2. Feature Extraction
    Converts raw audio into numerical features (e.g., MFCCs, spectrograms)

  3. Acoustic Model
    Maps features to phonemes or characters using ML/DL models

  4. Language Model
    Predicts word sequences for context-aware transcription

  5. Decoder
    Aligns acoustic and language models to produce final text

flowchart LR
  A[Audio Input<br/>Voice signal] --> B[Feature Extraction<br/>MFCCs / Spectrograms]
  B --> C[Acoustic Model<br/>ML/DL-based]
  C --> D[Language Model<br/>Contextual Prediction]
  D --> E[Decoder<br/>Final Text Output]

Techniques: Traditional ASR

  • HMMs + GMMs
    Early models for mapping acoustic features to phonemes

  • n-gram Language Models
    Predict word sequences based on prior word probabilities

  • Feature Extraction (MFCCs)
    Converts raw audio into spectral features

Python Libraries for Speech-to-Text (1/2)

  • Whisper (OpenAI)
    • Transformer-based, multilingual
    • High accuracy, runs offline
  • SpeechRecognition
    • Simple API wrapper for Google, IBM, etc.
    • Easy for beginners
  • Wav2Vec 2.0 (Hugging Face)
    • Pretrained self-supervised models
    • High-quality transcriptions

Python Libraries for Speech-to-Text (2/2)

  • DeepSpeech (Mozilla)
    • Lightweight and fast
    • Less accurate on noisy inputs
  • Kaldi (via PyKaldi)
    • Research-grade toolkit
    • Steeper learning curve
  • Vosk
    • Real-time offline STT
    • Works on Raspberry Pi, Android, desktops
    • Multilingual, very lightweight

Model Comparison: Features at a Glance

Model Accuracy Offline Multilingual Ease of Use Cost
Whisper ✅✅✅ ✅✅✅ ✅✅ Free
Wav2Vec 2.0 ✅✅ ⚠️ (mostly English) Free
Google API ✅✅✅ ✅✅✅ ✅✅✅ Paid
DeepSpeech ✅✅ Free
Kaldi ✅✅ ✅ (with effort) ⚠️ Complex Free
Vosk ✅✅ ✅✅ ✅✅✅ Free

Cost: Open-Source Models

Model Cost Offline Use Cloud Required
Whisper Free (Open Source)
Wav2Vec 2.0 Free (Open Source)
DeepSpeech Free (Open Source)
Kaldi Free (Open Source)
Vosk Free (Open Source)

Cost: Cloud APIs

Service Approx. Cost Offline Use Cloud Required
Google Speech API ~$1.44 per hour
AWS Transcribe ~$1.44 per hour
Azure Speech ~$1.60 per hour

Cost – Summary & Recommendation

  • Open-Source Models (Whisper, Wav2Vec, etc.)
    • ✅ Free and offline
    • ⚠️ Require setup and local resources
  • Cloud APIs (Google, AWS, Azure)
    • ✅ Easy to use, scalable
    • ⚠️ Ongoing cost and privacy trade-offs

✔️ Recommendation:
Use Whisper or Wav2Vec 2.0 for local, cost-effective transcription
Use Vosk for light, multilingual offline STT (e.g. edge devices)
Use Cloud APIs only for real-time or highly multilingual needs

✅ Final Decision

Chosen Model: openai/whisper-medium

🔹 Why Whisper Medium?

  • ✅ Good balance between accuracy and inference speed
  • ✅ Lower VRAM & compute requirements than large models
  • ✅ Suitable for local deployment (runs on RTX 4050)
  • ✅ Consistent performance across test cases
  • ✅ Robust to mild accents and moderate background noise

🔸 Not Chosen:

  • whisper-large: Higher accuracy but heavier (more VRAM)
  • whisper-tiny / base: Faster, but less accurate

📝 Transcript Evaluation – Whisper vs Original

🎙️ Sample: NPTEL Lecture – Literature History

📌 Original:
> At the outset, this being the first session, it is very important to give an overview of the course. This course is spread over 12 weeks and we may have 30 hours of teaching involved in this. Let me also introduce you to the objectives of this course so that the intentions become clearer.

🌀 Whisper Output:
> At the outset, this being the first session, it is very important to give an overview of the course. This course is spread over 12 weeks and we may have 30 hours of teaching involved in this and we also introduce you to the objectives of this course so that the intentions become more clear.

🔹 ✅ Core content retained
🔹 ⚠️ Minor stylistic variation – rephrasing & joined sentences

🧠 Observation Summary

  • Whisper captured all key points with high fidelity
  • Punctuation differences due to lack of postprocessing
  • Filler words (“we also”) inserted naturally from speech
  • Ideal for lecture summarization, accessibility, or note generation

📊 Whisper Accuracy

🔹 Quantitative Evaluation

  • Semantic Similarity (cosine): 0.9766 ✅
  • ROUGE-1 F1 Score: 0.9714
  • ROUGE-L F1 Score: 0.9524
  • Word Error Rate (WER): 12.43%

🔍 Interpretation

  • Very high semantic match – meaning fully preserved
  • Low WER – minor word-level issues
  • ROUGE scores confirm strong overlap in phrase structure
  • ⚠️ Differences mostly in punctuation, filler words, and phrasing

📝 Transcript Comparison – Whisper vs Original

🎙️ Sample: NPTEL Lecture – Taylor Series Intro

📌 Original:
> Welcome back to the lectures on Engineering Mathematics-I.
> Today, we will learn Taylor’s Polynomial and Taylor Series.

🌀 Whisper Output:
> Hi, welcome back to the lectures on Engineering Mathematics I and today’s we will learn Taylor’s polynomial and Taylor series.

🔹 ✅ Core message preserved
🔹 Minor changes:
- Added “Hi”
- “today’s we” instead of “today we”
- Punctuation differences only

🎙️ Sample: Exponential Function Example

📌 Original:
> The polynomial of degree 0 will simply be 1.
> If we plot this, it’s the green line through the point (0, 1).

🌀 Whisper Output:
> So, the polynomial of degree 0 will be simply 1 and if we plot this. So, this is the green plot here of the exponential function and this polynomial of degree 0 is just a constant line. So, the straight line going through this 0 1 point.

🔹 ✅ Richer phrasing from audio
🔹 Whisper added spontaneous repetitions and fillers (“so”, “here”)
🔹 All technical meaning retained

📊 Whisper Accuracy

🔹 Quantitative Evaluation

  • Semantic Similarity (cosine): 0.8882
  • ROUGE-1 F1 Score: 0.9116
  • ROUGE-L F1 Score: 0.8913
  • Word Error Rate (WER): 26.33%

🔍 Interpretation

  • Meaning mostly preserved, despite phrasing variation
  • ⚠️ Slight drop in surface-level overlap (function words, rewording)
  • ⚠️ Higher WER due to longer output and added filler words

Conclusion & Summary

  • STT is a mature, versatile technology
  • Open-source tools (Whisper, Wav2Vec) offer high quality with no cost
  • Cloud APIs provide convenience but incur recurring costs
  • Model choice depends on:
    ✅ Accuracy
    ✅ Cost
    ✅ Deployment constraints
    ✅ Privacy needs

✔️ Recommended:
Use Whisper for secure, offline transcription
Use cloud APIs only where real-time & scalability are critical

Thank You!

Questions? Suggestions?

🔗 View this presentation on GitHub:

Presented by Omkar Ninav — June 2025